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Abstract 

The number of extant individuals within a lineage, as exemplified by 
counts of species numbers across genera in a higher taxonomic category, is 
known to be a highly skewed distribution. Because the sublineages (such 
as genera in a clade) themselves follow a random birth process, deriving 
the distribution of lineage sizes involves averaging the solutions to a birth 
and death process over the distribution of time intervals separating the 
origin of the lineages. In this paper, we show that the resulting distribu- 
tions can be represented by hypergeometric functions of the second kind. 
We also provide approximations of these distributions up to the second 
order, and compare these results to the asymptotic distributions and nu- 
merical approximations used in previous studies. For two limiting cases, 
one with a relatively high rate of lineage origin, one with a low rate, the 
cumulative probability densities and percentiles are compared to show 
that the approximations are robust over a wide rane of parameters. It 
is proposed that the probability density distributions of lineage size may 
have a number of relevant applications to biological problems such as the 
coalescence of genetic lineages and in predicting the number of species in 
living and extinct higher taxa, as these systems are special instances of 
the underlying process analyzed in this paper. 
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1. Introduction 



The number of individuals in a genetic lineage (family size) and the formally 
equivalent problem of the number of species within genera or other higher taxa 
has been of wide interest to demographers and biologists alike (e.g. Watson 
and Galton 1875, Yule, 1924, Anderson 1974, Burlando 1990). Empirically, the 
distributions of family sizes in humans and taxon sizes in a number of organ- 
isms have been documented to be consistent with the distributions predicted by 
simple stochastic models. 

The most straightforward approximation to the distribution of lineage or 
taxon sizes is the pure birth process, which entails a certain probability of 
an individual giving birth or one species giving rise to another per unit time, 
ignoring death or extinction. In its details, this is clearly an unrealistic model, 
but it does provide a reasonable first order approximation for the diversification 
of higher taxa during their early history, when extinction is infrequent, or in 
lineages (such as small populations of bacteria growing on a rich medium) where 
the death rates are very small compared to birth rates. 

A basic model (Yule 1924, Feller 1950, Bailey 1968) represents a pure birth 
process with birth probability A per unit time. In this model, the instantaneous 
rate of change in the probability of observing n individuals is given by the 
following differential equation 

= A(n - l)f>„_i - Xnp n . (1.1) 

Under the initial condition pi(0)=l (a single ancestor for the lineage at time 
zero) and for all n >1, this differential equation has the solution 

p n (t)=e Xt (l-e Xt ) n - 1 . (1.2) 

More realistically, if there is an intrinsic death rate (or, for species and 
higher taxa, extinction rate) /x, then one has a time-homogeneous birth and 
death process defined by the differential equation, e.g. Kendall (1948), Darwin 
(1956), Keiding (1975), 

= A(n - l)p n -i - (A + \i)np n + fi(n + l)p n +i for n > 

dpo n Q s 

-& = m- (i-3) 

The solutions for A ^ /j, given p 1 (0)=l are given by 



Pn(t) 



(A - [i) 2 exp[-(A - n)t] fX - Acxp[-(A - fi)t] 



(A — ^exp[— (A — /i)i]) 2 \A — /iexp[— (A — /i)t] 



for n > 



Po(t) 



H — /iexp[— (A — n)t] 
A — /icxp[— (A — [x)t] 



(1.4) 
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In the special case where A=/i, the solutions simplify to: 



Mt) = ^ • (1.5) 

For the purposes of many empirical studies, in which one docs not know the 
precise number of once-extant lineages which are currently extinct (i.e. n =0 
individuals in the lineage or species in the clade at time t, with an indeterminate 
number in previous time intervals), it is useful to work with the truncated 
distributions describing the probability densities for strictly positive values of 
n, i.e. 

Pn(t)= -^-forn>l 
1 -Po{t) 

which, using the expressions in (1.4), evaluates to 



= P-M)exp[-(A-^)t] / A - Aexp[— (A - fi)t] \ " {mn>1 
A — /uexp[— (A — fj,)t] \ A — /xexp[— (A — /i)t] J 

(1-6) 

while for A=/U, 

P n (t) = ,1 A ^" * for n > 1. (1.7) 
w (Ai + 1)" ~ v ; 

Strictly speaking, in order to compare the distributions of family sizes to 
those predicted in (1.4) or (1.6), or to infer birth and death rates using maxi- 
mum likelihood estimators, the lineage ages must be known. This requirement 
becomes particularly problematic when one compares the number of individu- 
als in sublineages, for example the number of species in different genera within 
a taxonomic family or order. Because the sublineages arose at different times 
(i.e. obviously, not all genera within a clade originated at the same moment), 
comparisons of the number of species in a genus or any other subclade must be 
weighted by the differences in subclade ages. 

When the age of a lineage is known, as is the case for phylogenetic trees 
with branch lengths that have been callibrated using fossil data and "molecular 
clocks" (sensu Zuckcrkandl and Pauling 1965, Kimura 1980), the number of 
individuals or species in each lineage can be compared to the predictions of 
equation (1.4) by applying the appropriate time values. For phylogenies with 
known branch lengths and resolved split times, Harvey et al (1994) and Nee et al 
(1994, 1995) presented a method whereby the origination and extinction rates 
A and /i could be estimated using maximum likelihood methods. A similar 
approach, not requiring conditioning on the age of the first split, is given by 
Rannala (1997) and in Fclscnstcin (2004, pgs. 564-9). 
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Such detailed data is not reliably available for the majority of phylogenctic 
trees or gene genealogies, and when it is, the confidence intervals on the age 
estimates can be very wide. To address the problem of rate estimation when 
molecular clock estimates are absent or unreliable, Reed and Hughes (2002) 
derived distributions of genera sizes weighted by the time intervals separating 
generic branching events. Under the assumption that within a lineage (clade), 
subclades such as genera within families arise at a rate p, the authors used 
an exponential approximation for the distribution of the generic ages. This 
representation is not limited to any taxonomic group, as it applies to any lineage 
in which constituent sublineages are defined by some unique characters, genetic 
markers, or in the case of humans, family names. 

The time at which a genus within a clade of age r arises is given by 



/(*) = 7^7 (1-8) 



which, for t — > oo, gives 



f(t) = pe->>\ (1.9) 

Using (1.9) to weight the probability distributions p n at time t, Reed and Hughes 
defined 

q n (r) = f ppni^e-p'dt . (1.10) 
Jo 

Reed and Hughes derived a generating function and an algorithm for calculation 
of qn(r) recursively from a limiting case (when n^> oo). It will be shown that 
given certain choices of changes in variables, a closed form expression for this 
integral can be evaluated, and a number of useful approximations to the <7«(t) 
distribution can be made, of which the asymptotic approximation calculated by 
Reed and Hughes is an important limiting case. 



2. Results 

Pure Birth Process 

We will begin with the distribution for a pure birth process, which is a reasonable 
approximation when A ^> p,. 

q n (r) = f p exp[-(\ + p)t]{\ - exp[-\t]) n - l dt . (2.1) 
Jo 

Under the assumption that the lineage is ancient relative to the age of any 
sublineage, we take the limit as r — > oo. Then, applying a change of variables 
y = e At , the resulting integral is 

q n = P C y p/x (l-v) n - l dy (2.2) 
Jo 
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which, for n>l, evaluates to 

q n = Beta(p/A + 1, n) = ^|^^y ■ (2-3) 

Using an asymptotic expansion of the gamma ratios involving n above (e.g. 
Exton, 1978, Luke 1969), we get the following: This can be further simplified 
to an expansion of the ratios of gamma functions, up to the order n~ 2 , 



r(p/A + i) 
nP /\+i 



l - 



+ 



jO+j) 

2n 

(l+lKg±i) 

Sin 2 



3(f + l) 2 + f +0(n" 3 ) . 



It follows from the above that, as oo, the probability can be approximated 
by 



T(p/A + l)n 



-p/X-l 



(2.4) 



Birth and Death Processes 

As in the analyses of Reed and Hughes (2002), we consider three different birth 
and death models. The first, with A > p (birth rate higher than death rate), is 
perhaps the most biologically significant, because this is the only birth and death 
process in which a lineage has a nonzero probability of persisting indefinitely. 

We will evaluate the function Q n for the truncated solutions to the birth and 
death process given by equation (1.6), because most of the relevant empirical 
data involves n >0. For convenience we use the parameters c<j=A— /i and 0=p/ A, 
so that when the birth rate is greater than the death rate, u> >0 and 9 <1. The 
integral over the truncated distribution can be written as 

_ r°° puj exp[-(u + p)t] ( 1 - exp[-ut\ A"" 1 
^ n J l-0exp[wt] \\-6exv\-wt]) 
Applying the changes of variable 

l- exp[ -ut] (l-g)dy 
UK> \-6exp\-ujtY u{l-y){l-e y ) 

and noting that the integral limits are y(0) — and y(oo)=l, we obtain 

Qn = P (l - 0) [ y n -\l - y y/"{l - Oyy/^dy. (2.5) 
Jo 

If we do the same for Equation (1.4) by integrating over the part of the distri- 
bution where n>l, an expression of similar form is derived 
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q n = ^—^ [ y n ~\± - y) p/uJ (i - Oy)- p/ul d y . (2.6) 
w Jo 

The two integrals given above evaluate to constant multiples of a hypergeometric 
function (Exton 1978, Luke 1969), which has the integral representation 



L 



1 y a ~\l - y) c - a -\l - 9 Y )- b dy = J )na) 2 F 1 (a, b, c, 0) (2.7) 



r(c) 

and 2F1 is the Gauss hypergeometric function that is defined by the series 



( a )k(b)k ak 

2-riyu, u, u, u) = / j 
fe=0 

where 



,F 1 {aAc,e) = Y J ^f^O k (2.7a) 



{a) k = a(a + l)...(a + k - 1) = ^y^p- 

(Exton 1978, Luke 1969). In equations (2.5) and (2.6), we have a = n and 
b=l+p/uj. In the first instance, c = n+p/w+l, and c = n+p/w+2 in the 
second. Using (2.7) and (2.7a) in (2.5) we obtain 

00 

Q„=p(l-tf)r(l + 5)A£(l + 5) feB -. (2.8) 

k=0 

The coefficients A = A(p, u, n) and B = B(p, u, k, n) are given by 

m a r (") 



B(p,uj 7 k, n) 



r(n+£ + l)' 
(n)fc r(n + /c) T(n + p/w + 1) 



(n + p/cj + l) fe T(n)T(n + k + p/iu + 1) 

It is well known that the ratios of Gamma functions can be expanded in terms 
of 1/n, see for example Luke (1969). The expansion is in terms of Bernoulli 
polynomials, see also Moschopoulos and Mudholkar (1983) for similar asymp- 
totic expansions of hypergeometric functions. The approximations to follow are 
lengthy and rather tedious, but nevertheless straightforward. 

Second order approximations to the terms A and B are given by: 

A . „-(»*, ( 1 _iii±i) + e+fl(»+s>(»(fl'+s)) x (1 + 0( „- S)) 

I 2n 24n 2 1 

To solve for B, we use the following: 
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T(n + k) 



n k 1 + 



jfc(fc-l) fc(fc-l) (3(fc-l) 2 -fc-l) 



2n 



+ 



24n 2 



) x(l + 0(n- 3 )) 



r( n +£ + fc+l) _ fc / fc(fc + gg + l) fc(fc-l)(3(fc+^ + l) 2 -fc-l) 
r fn + £- + 1) ~ n 1 + 2n + 24n 2 



r(n + £ + i) 

x (l + Ofn" 3 )) 



r( n+ £+fc + l) 

r(n + 5 + i) 



n k \ 1 + 



fc(fc+^ + l) fc(fc-l) (3 (fc+^ + l) 2 -fc-l) ' 
2n + 24n 2 



(l + 0(n- 3 )) 



which allows one to rewrite B as a series expansion: 

B=(l+ fc(fc ~ 1} + k{k ~ 1 ] ( 3(fc ~ 1)2 ~ k ~ X ) 



2n 



24n 2 



I i + t(t+ ," +1 + W-W + g + i)'-*-')) "' x (1 + ^ 



2n 



24n 2 



= 1 - - (l + + -I^p, W) fe) + 0(n- 3 ) 

o2 



where 



Collecting terms, we obtain the following approximation up to order n 



Q n = p(l - 6)r (l + n- 1 -^ 



2n 24n 2 



fc=0 



o,w,fe) <9 fc 

^2 "fcT 



(l + 0(n- 3 )). (2.9) 



The terms in the above approximation underscore the fact that the shape of the 
distribution Q n is determined by the magnitude of p/ui, the ratio of the rates at 
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which new lineages (genera) arise to the net birth rate of individuals or species. 
This is because in the limit as t — > oo, the absolute magnitudes of the two 
rate parameters do not contribute to the ratio of lineage number to lineage size. 
Basically, a small value of pjuj implies that over sufficient time, there will be 
a comparatively low number of lineages rich in individuals, while a large value 
results in many lineages with relatively few individuals. 
It is also remarked that the series terms 

fe=0 k=0 

have to be computed numerically, just as the hypergeometric function must 
be estimated using numerical methods. As there is no closed-form expression 
for equation (2.9), the value of these approximations is to show the rate of 
convergence of first and second order terms (as functions of n -1 ) to the exact 
solution for different birth and death rates. 

In contrast, a closed form expression does exist for the asymptotic limit, as 
n — > 00. Unlike the terms involving higher powers of k in (2.9), the first series 
term within the parenthesis converges to 

By ignoring all factors of higher order than n _1 , we have 

Q n w p{\ - ef- p ' w T (l + n-P/"- 1 (2.10) 

as in Reed and Hughes (2002), apart from the constant factor of 1/u) for the 
truncated distribution. It is noteworthy that in equation (2.10) the term 9 only 
appears as a constant factor, while in (2.10) its magnitude determines the rate 
at which the series term converges. The efficacy of both the asymptotic and 
nT 2 approximations will be investigated numerically in the next section, where 
they are compared to the exact hypergeometric function solutions. 

We next direct our attention to the case of a "declining" linage where A < p,, 
i.e. when the birth or speciation rate is less than the death or extinction rate. 
In this scenario, the parameter values u> <0 and 6 <1. Using the same changes 
of variables as in the derivation of equation (2.6), the integral defining the 
probability density is 

,1/0 

Q n = p(l-6) y n - 1 (l-y) p / u (l-9y)- p ' u - 1 dy. (2.11) 
Jo 

With a further change of variables, z=9y, we again derive an expression with 
integral limits of and 1, 

Q n = p(l - 0)9 1 - n J z"- 1 (l - |) (1 - zy/^dz . (2.12) 
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The solution to this integral is a hypergeometric function of a form similar to 
that of equation (2.6). Here, the parameters are a=n, h=-p/w, and c= n-p/w+l. 
In place of 6 in the series expansion, we have 1/6, so that (2.11) ultimately 
evaluates to 

g B = p(i-*)*-r(-£) Aj2(-E.) k B-. (2 . 13) 

fc=0 

Note that because u) <0, — p/u >0. For the above distribution, the coefficients 
A and B are 



A - m B (p U kn)- {n)k _ r(" + fc)r(n-£ + l) 
T (n - £) ' ^' W ' *' n) ~ (n - £ + 1) k T(n) I> + k - * + 1) " 

An approximation in terms of leading orders of 1/n is possible for (2.13) 
using the same approach as was used in (2.9) i.e. by rewriting the Gamma 
functions as truncated polynomial ratios. 

For the A, the second order approximation is: 



„ £(£ + !) (£-l)(£-2)(3(l + £) 2 -£- . 
A = „= [ 1 - + ^ ^ | x (1 + 0(„-3)) 

To evaluate B, we have the same ratio T(n+k)/r(n) in the numerator as in 
the case of A > p, while the denominator term Y( 1-p/ui+k)/ T{n-p/uj) is: 




2n 24n 2 V 



(l + 0(n- 3 )) 



Using the series expansion of coefficients A and B, we obtain the following second 
order approximation for equation (2.13), 

Q n = p(l - 6)6^ r(-£)n£ 

5(^ + 1) (^-l)(^-2)[3(l + 5) 2 -5-l 

LP V UJ /_ _|_ L . 



2n 



24n 2 



fc=0 



<t>{p,u,k) (6*_ 
k\ 



0(n- 3 ). (2.14) 



Recall that w <0, so that the coefficients -p/u> are positive. 

The asymptotic approximation to this distribution, calculated for n — > oo, 
is given by the following equation: 
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Q n » P 6 n (l - ef-p'^Y (l + n-P^- 1 . (2.15) 

The special case where A=/i (exactly zero net growth rate) is not likely to occur 
in biological systems, therefore we do not analyze this scenario in detail. In 
those rare instances where net growth rate is close to zero, the distributions 
can be approximated by using the equations for the cases where growth rate is 
positive and negative and taking limits as p, approaches A. 

A Numerical Assessment of the Approximations 

The accuracy of the second order (2.9) and asymptotic approximations (2.10) are 
assessed by comparing their probability densities to those of the exact solutions 

(2.8) for a range of parameters values. The examples considered here are the 
biologically most relevant case where the birth rate is higher than the death 
rate, such that u >0 and 9 <1. 

Attention is focused on two parameters: p/co and 9. The first is the ratio of 
the rate at which new lineages arise to the net birth rate of individuals or species 
within that lineage. As was discussed above in connection with the distribution 

(2.9) , the shape of the probability density function depends principally on the 
ratio p/oj. When pjuj <Cl, the distribution will be characterized by very few 
lineages with large numbers of individuals, while as p/u> — »1, the distribution 
will be skewed to the left due to the presence of many lineages with few (perhaps 
only a single) individual. The accuracy of the approximation is examined for 
different values of this ratio, so that in one instance p and u> are of approximately 
the same magnitude, while in another p and uj differ by an order of magnitude. 

It is expected from the summation term in equation (2.9) that the second 
order approximation should be most accurate when 9 -Cl, corresponding to a 
low birth rate relative to the death rate, because the series converges rapidly 
when 9 is near zero. To compare the accuracy of the approximations as a 
function of the rate parameters, we will consider cases where of 0=0.01, 0.1, 
and 0.4. The predictions on the accuracy of the respective approximations are 
given qualitative support by Figures 1 and 2, which plot the cumulative density 
functions 

n 
JV=1 

for the exact solution Q n given by equation (2.8), the order n~ 2 approximation 
of equation (2.9), and the asymptotics in equation (2.10). In the first set of 
figures, the rates at which new lineages (e.g. "genera") are born is of the same 
order of magnitude as the net birth rate, i.e. p=0.02, w=0.05, while in the 
second set, there is a tenfold difference between the birth rates of lineages and 
individuals, with p=0.01 and u=0.1. 

INSERT FIGURES la-c and 2a-c HERE 
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The relationship between the size of 9 and the accuracy of approximation 

(2.9) can be seen by comparing Figures la and 2a with lc and 2c. In Figures 
la and 2a, 0=0.01, so that the second order approximation performs better 
than the asymptotic approximation. The exact solutions and second order ap- 
proximations start to diverge when 0=0.1, both in Figures lb and 2b. For the 
case where 0=0.4, the second order approximation is quite poor, being consis- 
tently outperformed by the asymptotic approximation. This is because equation 

(2.10) , unlike (2.9), has a closed form expression and does not itself have to be 
estimated numerically. 

As predicted, the asymptotic approximation begins to break down when the 
net birth rate (of species or individuals) as p, the rate at which new sublineages 
(family groups in a genealogy or genera in a clade) arise, approaches the value 
of u>, the birth or speciation rate. This is made apparent by comparing the 
difference between the exact solutions and the asymptotic approximations in 
Figures la-c (where p is of the same magnitude as ui) with those shown Figures 
2a-c, where p is an order of magnitude smaller than u>. The greater deviation 
between the asymptotic and the exact solutions is due to the fact that when 
p/uli is reasonably large, the coefficients of the n _1 and n~ 2 terms cannot be 
ignored, except when n is very large (of the order of ~10 3 or greater), so that 
the second order approximation can match the exact distribution quite closely 
given parameter values where the asymptotic approximation is inaccurate. 

A more precise assessment of the approximation accuracies can be made by 
comparing cumulative distributions functions. These are given in Table 1 for 
n=10, 100, 200, 500, 1000, and 10000 and are compared to exact and asymptotic 
values for the same set of birth and death rates that were shown in Figures 1-2. 
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Table 1. Prob.[A(i) < n | £,0] 
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For the smaller values of 8, the cumulative probabilities of the second or- 
der approximation are very close to the exact distribution, while for 0=0.5, the 
asymptotic is better. As was seen in the graphs, the asymptotic cumulative 
probabilities diverge for p of the same order of magnitude as ui. The asymptotic 
approximation are independent of 8, because this value only appears as a con- 
stant factor in equation (2.10), while it is a parameter in the hypergeometric 
function in (2.8) and the approximating sum in (2.9). 

Another measurement of the approximation was based on a numerical calcu- 
lation of percentiles (the calculations were implemented using the Mathematical, 
software package, with scripts available from the corresponding author by re- 
quest) for each distribution, for p=0.05 and p=0.01. The percentile values are 
shown in Table 2: 
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Table 2. Upper 95th and 99th Percentiles Percentiles Given 
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As predicted from the previous results, it can be seen that the critical values 
n* corresponding to each percentile are very similar for the exact and second 
order approximate distributions when 9 <Cl. Similarly, the critical values n* 
between the exact and second order diverge for larger 9, just as those of the 
asymptotic and exact distributions diverge for larger values p/u. 

Taken together, these examples illustrate the efficacy of the approximations 
under a range of rate parameter values. It remains to be determined whether 
the assumptions made in the derivations are realistic or sufficiently accurate for 
the biological processes being modeled. 

3. Discussion 

The results derived in this paper give analytical predictions for the number of 
individuals per lineage in a distribution of lineages given a time-homogeneous 
birth and death process, weighted by a time-homogeneous birth rate for sub- 
lineages. An obvious application of these distributions is to the problem that 
motivated this and similar earlier studies - the number of species within a genus 
or some higher taxonomic category. With modern techniques of phylogenetic 
inference, monophylctic subclades of any clade of interest can be identified (to- 
gether with reasonable bounds on age estimates in many instances) to yield 
an empirical frequency distribution of subclade sizes. These frequency distri- 
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butions can then be compared to the probability densities predicted from the 
exact solutions Q n . 

It is of particular interest to look at "anomalously" species rich subclades 
and evaluate the probability of their occurrence based on a null model, some- 
thing which can readily be done by computing the extremal probabilities. For 
example, if the average rate at which new genera arise in a sufficiently extant 
clade is 0.02 per million years, the net speciation rate w=0.05, and the ratio 
of extinction to speciation rate is 0.1, then it can be seen from the appropri- 
ate entry in Table 1 the probability of encountering a genus with over 10000 
species is less than 0.025 according to the distributions derived under the as- 
sumptions of time homogeneity If such unusually large genera are encountered, 
that may indicate that the birth-death process driving the distribution is in fact 
not time-homogeneous. 

Another application of these results to phylogenetic data is the estimation of 
speciation and extinction rates from taxon size distributions. The most straight- 
forward, and by far the most accurate, means of estimating these parameters 
is from direct paleontological data on origination and extinction rates, which 
is rarely available at the level of detail required for likelihood based estimates. 
The approach of Harvey and colleagues (e.g. Harvey et al 1994, Nee et al 1994) 
does not require fossil data, relying instead upon molecular clock estimates of 
subclade ages. In addition to the fact that the confidence intervals on these 
estimators tend to be very large (Paradis, 2004), there is the fact that reliable 
clock data is not available for many taxa. Consequently, the distributions in 
equations 2.8-2.10 can be used to give numerical estimates of the parameters, 
much as Reed and Hughes (2002) obtained recursively and in asymptotic ap- 
proximations. 

The underlying processes for which the distributions were derived are not 
limited to phylogenetic data on species abundance. Indeed, any process gener- 
ating new entities defined as lineages or families, that are themselves composed 
of individual entities undergoing birth and death, can be analyzed using the 
models presented in this paper. For example, gene genealogies within popu- 
lations and species often have constituent lineages that are defined by a set 
of point mutations. This is of course the foundation of coalescent based ap- 
proaches to population genetics, whereby the history of lineages can be inferred 
from sequence data (e.g. Kingman 1982, Hudson 1991, Hein et al 2005, Wake- 
ley 2008). The diversity of genotypes and lineages is produced by a branching 
process where the birth and death rates can be parameterized by A and /i, while 
the mutation rate defining new lineages in infinite sites or infinite alleles models 
can be used as a measure of p. In this way, haplotype distributions in popula- 
tion genetics can be analyzed in the same way as the taxonomic data described 
above, allowing one to determine whether the disproportionate representation 
of certain alleles or haplotypes is consistent with a random branching process, or 
whether the haplotype distribution must be explained by processes other than 
mutation and genetic drift. 

The correspondence between the data and the analytical results in all of these 
cases will to a large part be limited by a number of simplifying assumptions made 
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in the derivations. In particular, the distributions Q n were derived under the 
assumption that the time elapsed since the initiation of the process is very great, 
essentially infinite. Therefore, for comparatively young lineages (young with 
respect to the birth and death rates), one would expect the approximations to 
break down. These issues will be investigated using individual-based simulations 
and analysis of biological data in a forthcoming paper. 
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Figure Legends 

Figure 1: 

This set of figures shows plots of the cumulative density functions for the 
exact solution (Eq. 2.8), the second order approximation (2.9) and the asymp- 
totic approximation for the case where p=0.02 and w=0.05. 

a. Here 0=0.01, so that the second order approximation converges closely to the 
exact solution. The lower curve represents the asymptotic distribution, while 
the exact solution and second order approximation are nearly superimposed. 

b. For 6=0.1, the second order approximation remains close to the exact solu- 
tion. Note that the asymptotic is independent of 0, so that its divergence from 
the exact solution is constant in a-c. 

c. When 0=0.4, the second order approximation deviates significantly from 
the exact solution, even when the sum is taken to the 10 6 . This suggests that 
first and second order approximations are only robust when the death rate is 
sufficiently smaller than the birth rate. 

Figure 2: 

These figures show cumulative density functions Prob. [iV(t) < n | for 
the same distributions as in Figure 1, but in this instance p=0.01 and u)=0.1. 

a. Again, 0=0.01, so the second order approximation is very close. Furthermore, 
because p/oj is fourfold smaller than in Figure 1, the asymptotic approximation 
is much closer to the exact distribution. 

b. For 0=0.1, the second order approximation is still only minimally diver- 
gent. The asymptotic remains a robust approximation in this instance, being 
only dependent on p/ui. 
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c. When 0=0.4, the same is observed as in Figure lc. 

Table Legends 

Table 1: 

The cumulative probability densities Prob.[iV(i) < n \ £,6} of the exact 
solution (E), the second order approximation (S), and the asymptotic approxi- 
mation (A) arc shown for n=10, 20, 50, 100, 200, 500, 1000, 2000, and 10000, 
up to three significant digits. In the first half of the table, the ratio of lineage 
origination rate to net birth rate p/ui— 0.25, while p/ui^0.1 in the second por- 
tion. The densities are compared for #=0.01, 0.1, and 0.4, again showing the 
breakdown of the second order approximation for large 9, and of the asymp- 
totics for larger p/to. 

Table 2: 

The critical numbers of individuals n* corresponding to extremal probability 
densities p=0.05 and 0.01 are shown for the same parameter values and the same 
distributions as in Table 1. 
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